Update probe-interval and stale contact point timeout calculation #2601

Arkatufus · 2024-06-27T17:55:45Z

Changes

Change probe-interval default value to 5s
Move all effective value calculation out of the settings file
Removed stale-contact-point-timeout HOCON setting
Change calculation for stale contact point timeout inside BootstrapCoordinator to ( ProbeInterval + ProbeFailureTimeout ), it was ProbeFailureTimeout before.

Aaronontheweb

Question - if we have a bad probe value between intervals, i.e.

0:05 - hit node, it's good
0:10 - hit node, it's good
0:15 - no response from node
0:16 - join decider runs

Will that bad contact in step 3 invalidate the good contacts from earlier? It should, otherwise this will damage the algorithm. What happens there?

Aaronontheweb · 2024-06-28T14:56:14Z

src/management/Akka.Management/Cluster/Bootstrap/Internal/BootstrapCoordinator.cs

+            }
+            else
+            {
+                _staleContactPointTimeout = new TimeSpan((cps.ProbeInterval.Ticks + cps.ProbingFailureTimeout.Ticks) * 2);


Arkatufus · 2024-06-28T15:06:57Z

First of all, ContactPoint HTTP probing is done in parallel, so the first and second would happen pretty close to each other, then the failure happened.

Second, Depends on the RequiredContactPointNr setting and ContactWithAllContactPoints setting.

If RequiredContactPointNr is 2 and ContactWithAllContactPoints is false, the third failure is ignored and cluster formed immediately after the first 2 succeeded.
If RequiredContactPointNr is 2 and ContactWithAllContactPoints is true, the third failure blocks cluster formation until that entry is removed from either being stale or removed from the next round of Discovery resolution.

Arkatufus · 2024-06-28T15:22:32Z

Join decider ticks, discovery resolution ticks, and HTTP probing ticks happens in their own timers.

Decider ticks and HTTP probing tick timer interval is set to contact-point.probe-interval,
decider ticks runs on the BootstrapCoordinator,
HTTP probing runs in the HttpContactPointBootstrap,
HTTP probing have a failure timeout of 3s

Discovery resolution tick timer interval is set to contact-point-discovery.interval, it is run in the BootstrapCoordinator

Aaronontheweb · 2024-06-28T15:28:03Z

the third failure blocks cluster formation until that entry is removed from either being stale or removed from the next round of Discovery resolution.

Question I'm really asking is that if a node somehow dies shortly after being successfully pinged, before a cluster is formed, is it possible to form a cluster under the RequiredContactPointNr due to a stale entry?

Arkatufus · 2024-06-28T15:54:05Z

I guess this is where our and scala implementation diverge with this update.
In the scala implementation, there is no stale contact point timeout setting, possible dead node entry is being guarded and pruned by a TTL that is the same as the probe-failure-timeout value. If a HTTP probe died without updating its entry in the contact point list, the coordinator is guaranteed to remove the entry at the same interval as the probe timeout interval.

Here, we're lenghtening that value so there is actually a possibility that there will be a race condition where a contact point is actually dead but its not being pruned out of the contact point list, if that makes any sense.

Aaronontheweb · 2024-06-28T16:12:46Z

possible dead node entry is being guarded and pruned by a TTL that is the same as the probe-failure-timeout value. If a HTTP probe died without updating its entry in the contact point list, the coordinator is guaranteed to remove the entry at the same interval as the probe timeout interval.

So we should redesign this feature then - no need for a separate stale value, just always slide the timeout with the interval at all times (i.e it's never a hard timeout, it's always a +n value.) That'd be even simpler.

Arkatufus · 2024-06-28T16:22:02Z

~~We'd need to remove 1.5.26 from NuGet then~~
Nevermind, I somehow crossed Akka.Management and Akka.NET, again... geeze

Arkatufus · 2024-06-28T16:45:49Z

ok, done

Aaronontheweb

LGTM

Aaronontheweb · 2024-07-01T21:04:33Z

src/management/Akka.Management/Cluster/Bootstrap/Internal/BootstrapCoordinator.cs

-            {
-                _staleContactPointTimeout = new TimeSpan((cps.ProbeInterval.Ticks + cps.ProbingFailureTimeout.Ticks) * 2);
-            }
+            _staleContactPointTimeout = cps.ProbeInterval + cps.ProbingFailureTimeout;


Arkatufus and others added 4 commits June 28, 2024 00:50

Update probe-interval and stale contact point timeout calculation

e31fca1

Merge branch 'dev' into update-probe-interval

bf008ef

Merge branch 'dev' into update-probe-interval

56c502f

Merge branch 'dev' into update-probe-interval

ae706a0

Aaronontheweb reviewed Jun 28, 2024

View reviewed changes

Merge branch 'dev' into update-probe-interval

916faca

Remove stale-contact-point-timeout setting

c3375f2

Merge branch 'dev' into update-probe-interval

0633682

Aaronontheweb approved these changes Jul 1, 2024

View reviewed changes

Aaronontheweb merged commit c7328c6 into akkadotnet:dev Jul 1, 2024
3 checks passed

Arkatufus mentioned this pull request Jul 2, 2024

Update RELEASE_NOTES.md for 1.5.26-beta3 release #2617

Merged

Arkatufus deleted the update-probe-interval branch July 15, 2024 20:33

Arkatufus mentioned this pull request Jul 15, 2024

Update RELEASE_NOTES.md for 1.5.26 release #2655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update probe-interval and stale contact point timeout calculation #2601

Update probe-interval and stale contact point timeout calculation #2601

Arkatufus commented Jun 27, 2024 •

edited

Loading

Aaronontheweb left a comment

Aaronontheweb Jun 28, 2024

Arkatufus commented Jun 28, 2024

Arkatufus commented Jun 28, 2024

Aaronontheweb commented Jun 28, 2024

Arkatufus commented Jun 28, 2024

Aaronontheweb commented Jun 28, 2024

Arkatufus commented Jun 28, 2024 •

edited

Loading

Arkatufus commented Jun 28, 2024

Aaronontheweb left a comment

Aaronontheweb Jul 1, 2024

Update probe-interval and stale contact point timeout calculation #2601

Update probe-interval and stale contact point timeout calculation #2601

Conversation

Arkatufus commented Jun 27, 2024 • edited Loading

Changes

Aaronontheweb left a comment

Choose a reason for hiding this comment

Aaronontheweb Jun 28, 2024

Choose a reason for hiding this comment

Arkatufus commented Jun 28, 2024

Arkatufus commented Jun 28, 2024

Aaronontheweb commented Jun 28, 2024

Arkatufus commented Jun 28, 2024

Aaronontheweb commented Jun 28, 2024

Arkatufus commented Jun 28, 2024 • edited Loading

Arkatufus commented Jun 28, 2024

Aaronontheweb left a comment

Choose a reason for hiding this comment

Aaronontheweb Jul 1, 2024

Choose a reason for hiding this comment

Arkatufus commented Jun 27, 2024 •

edited

Loading

Arkatufus commented Jun 28, 2024 •

edited

Loading